Model Selection

Multimodal Processing

# Multimodal Processing

Bart Large Empathetic Dialogues

This model is based on the transformers library, and its specific purpose and functionality require further information to determine.

Large Language Model

Openclip ViT H 14 FARE2

A robust image encoder model based on the Transformers library, focused on image feature extraction tasks

Large Language Model

Mixtex Finetune

MixTex base_ZhEn is an image-to-text model supporting both Chinese and English, released under the MIT License.

Image-to-Text Supports Multiple Languages

Gemma 3 Glitter 4B

Optimized model based on Gemma 3 4B, using the same data mixing scheme as Glitter 12b

Large Language Model

Ola-7B is a multimodal large language model jointly developed by Tencent, Tsinghua University, and Nanyang Technological University. Based on the Qwen2.5 architecture, it supports processing text, image, video, and audio inputs and generates text outputs.

Multimodal Fusion Supports Multiple Languages

Florence 2 FT DocVQA

A document visual question answering model fine-tuned based on Florence-2-base, specifically designed for handling QA tasks in document images.

Transformers English

Longvu Llama3 2 1B

LongVU is a spatio-temporal adaptive compression technology designed for long video language understanding, aiming to efficiently process long video content and enhance language comprehension.

Oryx-1.5-7B is a 7B-parameter model developed based on the Qwen2.5 language model, supporting a 32K token context window and specializing in efficiently processing visual inputs of arbitrary spatial dimensions and durations.

Text-to-Video Supports Multiple Languages

Longvu Llama3 2 3B

LongVU is a spatio-temporal adaptive compression technology for long video language understanding, designed to efficiently process long video content.

H2ovl Mississippi 800m

An 800M-parameter vision-language model from H2O.ai, specializing in OCR and document understanding with excellent performance

Transformers English

Florence 2 DocVQA

A version fine-tuned for 1 day using the Docmatix dataset (5% data volume) based on Microsoft's Florence-2 model, suitable for image-text understanding tasks

Florence 2 Large Florence 2 Large Nsfw Pretrain Gt

This model is based on the transformers library, and its specific functions and uses require further information for confirmation.

Large Language Model

Ucmt Sam On Depth

A mask generation model implemented in PyTorch, integrated and pushed to the Hub via PytorchModelHubMixin

Image Segmentation

Ecot Openvla 7b Oxe

A pretrained Transformer model for robotic control tasks, supporting basic functions such as motion planning and object grasping

Large Language Model

Donut is a Transformer-based image-to-text model capable of extracting and generating textual content from images.

Icon Captioning Model

This is an image caption generation model based on the BLIP architecture, specifically designed to generate text descriptions for icons or simple images.

Fine Tuned Rvl Cdip

A fine-tuned version of the microsoft/layoutlmv3-base model for document image classification tasks, achieving an F1 score of 0.8177 on the evaluation set

Text Recognition

Interpret Cxr Impression Baseline

This model can convert medical images (such as X-rays) into descriptive text to assist in medical diagnosis.

Output LayoutLMv3 V7

A document understanding model fine-tuned based on microsoft/layoutlmv3-base, excelling in document layout analysis tasks

Text Recognition

Donut Base Handwriting Recognition

Handwriting recognition model fine-tuned based on naver-clova-ix/donut-base

Text Recognition

Llava Maid 7B DPO GGUF

LLaVA is a large language and vision assistant model capable of handling multimodal tasks involving images and text.

Docllm Baichuan2 7b

DocLLM_reimplementation is a large language model implementation project for document understanding tasks, aimed at reimplementing and improving document comprehension capabilities.

Large Language Model

JinghuiLuAstronaut

This model is designed to convert charts into structured tables, built on the UniChart architecture, with generated tables using specific delimiters to represent row and column structures.

Transformers English

This model is a fine-tuned version of microsoft/layoutlmv2-base-uncased on the generator dataset, suitable for document understanding and layout analysis tasks.

Large Language Model

Git Base Next Refined

Fine-tuned image-to-text model based on microsoft/git-base

Large Language Model

Transformers Other

Fine-tuned image-to-text model based on microsoft/git-base

Transformers Other

This model is outdated. It is recommended to use the official Nougat model. Nougat is an advanced vision-language model focused on document understanding and analysis.

Git Base Fashion

An image-to-text model fine-tuned from microsoft/git-base, specialized for the fashion domain

Transformers Other

Donut Trained Example 2

Model fine-tuned based on naver-clova-ix/donut-base, specific purpose not clearly stated

Large Language Model

A model fine-tuned based on naver-clova-ix/donut-base, specific uses and functions require more information

Wavlm Bert Fusion S Emotion Russian Resd

A multimodal fusion model based on WavLM and BERT, suitable for joint speech and text task processing.

Speech Recognition

DePlot is a visual-language reasoning model capable of converting chart images into linearized tables, enabling few-shot reasoning when combined with large language models

Transformers Supports Multiple Languages

A document understanding model fine-tuned from naver-clova-ix/donut-base, suitable for image folder datasets

Text Recognition

Model fine-tuned based on naver-clova-ix/donut-base, specific purpose not explicitly stated

Large Language Model

Layoutlmv2 Base Uncased Finetuned Docvqa V2

This model is a fine-tuned version of microsoft/layoutlmv2-base-uncased for document visual question answering tasks, focusing on processing text and layout information in document images.

Layoutlmv3 Finetuned Funsd

A document understanding model fine-tuned on the nielsr/funsd-layoutlmv3 dataset based on microsoft/layoutlmv3-base

Text Recognition

A model capable of converting image content into textual descriptions, suitable for various vision-language tasks.

Transformers English

VisionEncoderDecoder model fine-tuned on the CORD-v2 dataset for document understanding tasks

Text Recognition

Layoutlmv3 Finetuned Cord

A document understanding model fine-tuned on the CORD dataset based on LayoutLMv3, excelling in document token classification tasks

Text Recognition

Layoutlmv3 Finetuned Funsd

A document understanding model fine-tuned on the FUNSD dataset based on the LayoutLMv3-base model, excelling in token classification tasks for forms and documents

Text Recognition

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase